Instructions

The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.

Introduction

This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of official sources and hosted on the Our World in Data website (Mathieu et al. 2021).

Analyse Data in R

To run R and RStudio on Binder, click on this badge - Launch Rstudio Binder.

Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, patchwork to include several plots in a single figure and a few others to assist in getting the data into shape.
To use these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.

# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
p_load(tidyverse,  paletteer, glue, scales, plotly, lubridate, patchwork, visdat)

More information on installing and using R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).

We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data directly from a file on the Our World in Data website into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which won’t automatically parse columns containing dates and in previous versions of R (<4.0) will slightly change the structure of the resulting data frame (all text columns will be converted into factors).
> Note that in this case, we need to specify the column types because the data contains a lot of missing values that interfere with the automatic parsing of the column types._

# read data from Our World in Data website
covid_data <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv", 
                       col_types = paste(c("c", "f", "c", "D", rep("d", 29), 
                                           "c", rep("d", 33)), collapse = ""))

Data Exploration

Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column (see detailed information on each variable in the data repository on GitHub):

#explore the data frame
head(covid_data) # show first 10 rows of the data and type of variables
## # A tibble: 6 × 67
##   iso_code continent location  date       total_cases new_cases new_cases_smoot…
##   <chr>    <fct>     <chr>     <date>           <dbl>     <dbl>            <dbl>
## 1 AFG      Asia      Afghanis… 2020-02-24           5         5           NA    
## 2 AFG      Asia      Afghanis… 2020-02-25           5         0           NA    
## 3 AFG      Asia      Afghanis… 2020-02-26           5         0           NA    
## 4 AFG      Asia      Afghanis… 2020-02-27           5         0           NA    
## 5 AFG      Asia      Afghanis… 2020-02-28           5         0           NA    
## 6 AFG      Asia      Afghanis… 2020-02-29           5         0            0.714
## # … with 60 more variables: total_deaths <dbl>, new_deaths <dbl>,
## #   new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## #   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## #   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## #   new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## #   icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## #   hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The skim() function (from the skimr package) provides an alternative to the base summary() function and produces a summary of basic descriptive statistics, such as the mean, min, max, quantiles and even the SD and rate of missing data for continuous numerical values.

# summary of variables in my data
# summary(covid_data)
skim(covid_data)
Table 1: Data summary
Name covid_data
Number of rows 206852
Number of columns 67
_______________________
Column type frequency:
character 3
Date 1
factor 1
numeric 62
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
iso_code 0 1.00 3 8 0 244 0
location 0 1.00 4 32 0 244 0
tests_units 100066 0.52 13 15 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2020-01-01 2022-08-07 2021-05-31 950

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
continent 0 1 FALSE 7 Afr: 47888, Eur: 44907, Asi: 44592, Nor: 31958

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cases 8381 0.96 3817978.55 2.396678e+07 1.00 3401.00 41533.00 446271.00 5.841024e+08 ▇▁▁▁▁
new_cases 8666 0.96 12581.45 8.730694e+04 0.00 0.00 65.00 1013.00 4.079467e+06 ▇▁▁▁▁
new_cases_smoothed 9843 0.95 12578.91 8.522271e+04 0.00 6.71 100.43 1149.43 3.437955e+06 ▇▁▁▁▁
total_deaths 27155 0.87 69051.36 3.577979e+05 1.00 103.00 1032.00 9090.00 6.417879e+06 ▇▁▁▁▁
new_deaths 27198 0.87 148.22 7.611400e+02 0.00 0.00 1.00 16.00 1.819100e+04 ▇▁▁▁▁
new_deaths_smoothed 28365 0.86 148.93 7.446800e+02 0.00 0.14 1.71 17.29 1.481714e+04 ▇▁▁▁▁
total_cases_per_million 9293 0.96 50068.28 9.138295e+04 0.00 909.94 8158.78 61425.71 6.553093e+05 ▇▁▁▁▁
new_cases_per_million 9578 0.95 188.93 9.174400e+02 0.00 0.00 9.24 102.51 1.950053e+05 ▇▁▁▁▁
new_cases_smoothed_per_million 10750 0.95 188.87 5.993700e+02 0.00 1.49 19.93 135.62 3.525884e+04 ▇▁▁▁▁
total_deaths_per_million 28054 0.86 637.81 9.258300e+02 0.00 26.35 184.22 946.65 6.364670e+03 ▇▁▁▁▁
new_deaths_per_million 28097 0.86 1.50 5.260000e+00 0.00 0.00 0.06 1.10 5.538000e+02 ▇▁▁▁▁
new_deaths_smoothed_per_million 29259 0.86 1.50 3.450000e+00 0.00 0.00 0.23 1.49 1.486700e+02 ▇▁▁▁▁
reproduction_rate 51822 0.75 0.97 3.800000e-01 -0.05 0.77 0.98 1.17 5.860000e+00 ▇▃▁▁▁
icu_patients 179515 0.13 859.31 2.501980e+03 0.00 33.00 170.00 619.00 2.889100e+04 ▇▁▁▁▁
icu_patients_per_million 179515 0.13 22.24 2.672000e+01 0.00 4.08 11.81 31.58 1.803900e+02 ▇▂▁▁▁
hosp_patients 177595 0.14 4210.63 1.097507e+04 0.00 163.00 814.00 3202.00 1.545130e+05 ▇▁▁▁▁
hosp_patients_per_million 177595 0.14 160.29 1.958400e+02 0.00 31.38 89.38 207.72 1.546500e+03 ▇▁▁▁▁
weekly_icu_admissions 200074 0.03 439.40 5.949800e+02 0.00 42.00 206.00 603.00 4.838000e+03 ▇▁▁▁▁
weekly_icu_admissions_per_million 200074 0.03 13.71 1.544000e+01 0.00 3.66 8.69 17.93 2.229000e+02 ▇▁▁▁▁
weekly_hosp_admissions 193415 0.06 5637.07 1.362218e+04 0.00 372.00 1339.00 5376.00 1.539880e+05 ▇▁▁▁▁
weekly_hosp_admissions_per_million 193415 0.06 99.81 1.004500e+02 0.00 27.50 72.50 134.91 6.589400e+02 ▇▂▁▁▁
total_tests 127467 0.38 21093607.55 8.404015e+07 0.00 364660.00 2067330.00 10248295.00 9.214000e+09 ▇▁▁▁▁
new_tests 131451 0.36 67283.91 2.477363e+05 1.00 2244.00 8783.00 37228.00 3.585563e+07 ▇▁▁▁▁
total_tests_per_thousand 127467 0.38 916.22 2.168240e+03 0.00 43.77 234.26 891.93 3.292590e+04 ▇▁▁▁▁
new_tests_per_thousand 131451 0.36 3.24 8.970000e+00 0.00 0.29 0.97 2.91 5.340100e+02 ▇▁▁▁▁
new_tests_smoothed 102889 0.50 142176.50 1.138225e+06 0.00 1486.00 6570.00 32204.00 1.476998e+07 ▇▁▁▁▁
new_tests_smoothed_per_thousand 102889 0.50 2.80 7.260000e+00 0.00 0.20 0.86 2.58 1.476000e+02 ▇▁▁▁▁
positive_rate 110932 0.46 0.10 1.200000e-01 0.00 0.02 0.06 0.14 1.000000e+00 ▇▁▁▁▁
tests_per_case 112794 0.45 2397.00 3.349475e+04 1.00 7.10 17.50 53.70 1.023632e+06 ▇▁▁▁▁
total_vaccinations 149596 0.28 258320512.19 1.079031e+09 0.00 942245.75 7791557.50 51941563.25 1.239910e+10 ▇▁▁▁▁
people_vaccinated 152217 0.26 122220465.03 5.093437e+08 0.00 536085.00 4311020.00 25961193.00 5.311033e+09 ▇▁▁▁▁
people_fully_vaccinated 154715 0.25 103974711.49 4.481811e+08 1.00 458891.00 3854304.00 22353415.00 4.880663e+09 ▇▁▁▁▁
total_boosters 177954 0.14 49079214.10 2.037780e+08 1.00 52759.25 1670522.50 12824204.75 2.296682e+09 ▇▁▁▁▁
new_vaccinations 159924 0.23 1026683.71 3.854329e+06 0.00 4761.75 36923.50 269547.25 4.966707e+07 ▇▁▁▁▁
new_vaccinations_smoothed 91004 0.56 433909.63 2.463675e+06 0.00 701.00 7009.50 53783.25 4.368788e+07 ▇▁▁▁▁
total_vaccinations_per_hundred 149596 0.28 93.52 7.560000e+01 0.00 20.79 85.12 150.95 3.668700e+02 ▇▅▃▁▁
people_vaccinated_per_hundred 152217 0.26 44.73 3.010000e+01 0.00 14.23 48.80 71.82 1.287800e+02 ▇▅▇▅▁
people_fully_vaccinated_per_hundred 154715 0.25 39.69 2.932000e+01 0.00 9.45 41.17 65.80 1.267900e+02 ▇▃▆▂▁
total_boosters_per_hundred 177954 0.14 22.02 2.360000e+01 0.00 0.53 12.07 40.15 1.342700e+02 ▇▃▂▁▁
new_vaccinations_smoothed_per_million 91004 0.56 2730.81 3.655220e+03 0.00 395.00 1466.50 3810.00 1.178620e+05 ▇▁▁▁▁
new_people_vaccinated_smoothed 91903 0.56 165641.43 1.010110e+06 0.00 210.00 2386.00 19121.00 2.106905e+07 ▇▁▁▁▁
new_people_vaccinated_smoothed_per_hundred 91903 0.56 0.12 2.200000e-01 0.00 0.01 0.04 0.14 1.179000e+01 ▇▁▁▁▁
stringency_index 51666 0.75 50.24 2.168000e+01 0.00 35.19 49.92 67.59 1.000000e+02 ▂▆▇▆▂
population 1229 0.99 143216682.11 6.961334e+08 47.00 896007.00 7494578.00 33573874.00 7.909295e+09 ▇▁▁▁▁
population_density 22778 0.89 457.35 2.109000e+03 0.14 37.73 88.12 214.24 2.054677e+04 ▇▁▁▁▁
median_age 36286 0.82 30.64 9.070000e+00 15.10 22.30 30.60 39.10 4.820000e+01 ▇▆▇▆▆
aged_65_older 38089 0.82 8.82 6.130000e+00 1.14 3.53 6.70 14.18 2.705000e+01 ▇▃▂▂▁
aged_70_older 37179 0.82 5.56 4.160000e+00 0.53 2.06 4.21 8.68 1.849000e+01 ▇▃▂▂▁
gdp_per_capita 37275 0.82 19602.11 2.056391e+04 661.24 4449.90 12951.84 27936.90 1.169356e+05 ▇▂▁▁▁
extreme_poverty 96239 0.53 13.63 2.003000e+01 0.10 0.60 2.20 21.20 7.760000e+01 ▇▁▁▁▁
cardiovasc_death_rate 36710 0.82 260.96 1.199300e+02 79.37 170.05 243.96 329.94 7.244200e+02 ▇▇▃▁▁
diabetes_prevalence 28270 0.86 8.37 4.690000e+00 0.99 5.31 7.20 10.59 3.053000e+01 ▇▇▂▁▁
female_smokers 78160 0.62 10.66 1.060000e+01 0.10 1.90 6.30 19.30 4.400000e+01 ▇▂▂▁▁
male_smokers 79923 0.61 32.80 1.353000e+01 7.70 21.60 31.40 41.30 7.810000e+01 ▆▇▆▂▁
handwashing_facilities 123295 0.40 50.88 3.185000e+01 1.19 20.86 49.84 82.50 1.000000e+02 ▇▅▅▅▇
hospital_beds_per_thousand 55673 0.73 3.09 2.550000e+00 0.10 1.30 2.50 4.00 1.380000e+01 ▇▃▂▁▁
life_expectancy 13348 0.94 73.65 7.450000e+00 53.28 69.50 75.05 79.07 8.675000e+01 ▁▃▅▇▅
human_development_index 41146 0.80 0.73 1.500000e-01 0.39 0.60 0.74 0.84 9.600000e-01 ▂▅▅▇▇
excess_mortality_cumulative_absolute 199831 0.03 44438.78 1.238340e+05 -37726.10 12.10 5114.90 31461.00 1.219078e+06 ▇▁▁▁▁
excess_mortality_cumulative 199831 0.03 9.62 1.377000e+01 -28.45 0.18 7.07 15.10 7.655000e+01 ▁▇▂▁▁
excess_mortality 199813 0.03 14.73 2.708000e+01 -95.92 -0.30 7.25 20.64 3.759800e+02 ▂▇▁▁▁
excess_mortality_cumulative_per_million 199831 0.03 1219.06 1.628300e+03 -1694.15 7.01 679.86 1915.23 9.726080e+03 ▇▆▂▁▁
# find extreme rows
covid_data %>% arrange(desc(total_cases))
## # A tibble: 206,852 × 67
##    iso_code continent location date       total_cases new_cases new_cases_smoot…
##    <chr>    <fct>     <chr>    <date>           <dbl>     <dbl>            <dbl>
##  1 OWID_WRL ""        World    2022-08-06   584102431   1174416         1039605.
##  2 OWID_WRL ""        World    2022-08-05   582928015    912410          953741.
##  3 OWID_WRL ""        World    2022-08-04   582015605   1429455         1019857.
##  4 OWID_WRL ""        World    2022-08-03   580597930   1146509          969115.
##  5 OWID_WRL ""        World    2022-08-02   579451421   1170931         1006328.
##  6 OWID_WRL ""        World    2022-08-01   578280490    910073         1006651.
##  7 OWID_WRL ""        World    2022-07-31   577370417    533440         1023214.
##  8 OWID_WRL ""        World    2022-07-30   576836977    573370         1025368.
##  9 OWID_WRL ""        World    2022-07-29   576263607   1375222         1034208.
## 10 OWID_WRL ""        World    2022-07-28   574888391   1074261          991894.
## # … with 206,842 more rows, and 60 more variables: total_deaths <dbl>,
## #   new_deaths <dbl>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## #   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## #   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## #   new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## #   icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## #   hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, …

What are the metadata columns that describe our observations?

continent 
location  
date

Why do we have observations with missing continent?

# check which location have continent as NA
covid_data %>% filter(continent=="" | is.na(continent)) %>% count(location)
## # A tibble: 13 × 2
##    location                n
##    <chr>               <int>
##  1 Africa                906
##  2 Asia                  928
##  3 Europe                927
##  4 European Union        927
##  5 High income           928
##  6 International         912
##  7 Low income            896
##  8 Lower middle income   928
##  9 North America         928
## 10 Oceania               925
## 11 South America         897
## 12 Upper middle income   928
## 13 World                 928
Some rows contain summarised data of entire continents/World, we'll need to remove those

We can see that most of our data contains ‘0’ (check the difference between the median and the mean in total_cases and total_deaths columns). Just to confirm that, let’s plot a histogram of all the confirmed cases

ggplot(covid_data, aes(x=total_cases)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(def_text_size)

The data is evolving over days (a time-series), to there’s no point treating it as a random population sample.

Time-series plot

Let’s look at confirmed cases and total deaths data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:

  1. First we remove all observations for combined continents with filter(!is.na(continent)
  2. Then we group it by location with group_by()
  3. Then we sort it within each location by date (from latest to earliest) with arrange(desc(date))
  4. We select just the most recent data point from each location with slice(1) and remove grouping with ungroup()
  5. Next we arrange it by descending order of total cases and select the top 10 observations (one for each location)
  6. Finally, we subset our original data to contain just the countries from our vector with inner_join()

Optional step:

  1. We can recode the location variable as a factor and order it so the countries will be ordered in the legend by the number of cases

Then we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.

# find the 10 most affected countries (to date)
latest_data <- covid_data %>% filter(!is.na(continent), continent!="") %>% 
  group_by(location) %>% arrange(desc(date)) %>% slice(1) %>% ungroup() 
most_affected_countries <- latest_data  %>%  
  arrange(desc(total_cases)) %>% slice(1:10) %>% 
  select(location)

# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>% 
  inner_join(most_affected_countries) %>% 
  mutate(Country=factor(location, levels = most_affected_countries$location))

# create a line plot the data of total cases
ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  labs(color="Country", y = "Total COVID-19 cases") +
  theme_bw(def_text_size)

We can look at the exponential increase at the early days of the pandemic if we transform the y-axis to logarithmic scale (and also improve how of the dates appear in the x-axis).

# better formatting of date axis, log scale 
plot <- ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
  geom_line(size=0.75) + scale_y_log10(labels=comma) + 
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  labs(color="Country", y = "Total COVID-19 cases") +
  theme_bw(def_text_size)
# show an interactive plot
ggplotly(plot)

Why did we get a warning message and why the graphs don’t start at the bottom of the x-axis? How can we solve it? What can we infer from the graph (exponential increase)?

What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function (or add a very small number to them)?
We can see a very similar trend for most countries and while the curve has flattened substantially in 2021, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October 2020 and then again around Feb 2022 (with a huge rise in cases in South Korea).

Vaccination rates

Let’s have a look at the number of vaccinated people.

# vaccination rates
ggplot(most_affected_data, aes(x=date, y=people_vaccinated, colour=Country)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y="People Vaccinated") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())

The graphs are “broken”, meaning that it is not continuous and we have some missing data.
Let’s visualise some of the variables in our data and assess “missingness”.

# visualise missingness
vis_dat(covid_data %>% filter(date>dmy("01-01-2022")) %>% 
          select(continent, location, total_cases, total_deaths, 
                 hosp_patients, people_vaccinated, people_fully_vaccinated))

# find which countries has the most number of observations (least missing data)
# covid_data %>% filter(!is.na(continent), continent!="", !is.na(people_vaccinated)) %>% # group_by(location) %>% 
#   count(location) %>% arrange(desc(n)) %>% print(n=30)
# 
# covid_data %>% filter(!is.na(continent), continent!="", !is.na(hosp_patients)) %>% # group_by(location) %>% 
#   count(location) %>% arrange(desc(n)) %>% print(n=30)

Hospitalisation and Vaccination rates

Now we can focus on a subset of countries that have more complete vaccination and hospitalisation rates data, so we could compare them.

countries_subset <- c("Italy", "Canada", "Israel", "United Kingdom",  "France", "Australia", "Malaysia", 'South Africa', "India")
# subset our original data to these countries
hosp_data <- covid_data %>% filter(location %in% countries_subset)
# define a new colour palette for these countries
col_pal <- "ggsci::category10_d3"

Let’s look at hospitalisation rates first.

ggplot(hosp_data, aes(x=date, y = hosp_patients,colour=location)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Hospitalised patients") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())

Can you identify the “waves” in each country?
It’s hard to see the details in the countries with lower number of hospitalised patients, how can we improve the visualisation?

Look at hospitalision rates proportional to the population size!
# hosp per population size
p1 <- ggplot(hosp_data, 
       aes(x=date, y = hosp_patients_per_million,colour=location)) +
  geom_line(size=0.75) + 
  scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Hospitalised patients (per million)") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())
p1

Now let’s try to compare this to vaccination rates

# total vaccination per population
p2 <- ggplot(hosp_data, 
             aes(x=date, y = people_fully_vaccinated_per_hundred/100 ,colour=location)) +
  geom_line(size=0.75, linetype="dashed") + 
  guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
  scale_y_continuous(labels=percent) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Rate of fully vaccinated people") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())
p2

What will be the best way to compare these values? Maybe like this (note that this syntax is used by patchwork package to arrange the plots):

(p1 + guides(color="none")) / (p2 + theme(legend.position = "bottom")) #+ plot_layout(guides = 'collect')# show graphs on top of each other

Any other suggestions? Let’s try to present them on the same graph (note the trick with the secondary y-axis), but for each country separately (done with the facet_wrap() function)..

# show on the same graph
p4 <- ggplot(hosp_data, 
       aes(x=date, colour=location)) +
  geom_line(aes(y = hosp_patients_per_million), size=0.75) + 
  geom_line(aes(y = people_fully_vaccinated_per_hundred*10), size=0.75, linetype="dashed") + 
  scale_y_continuous(name = "Hospitalised patients per million (solid)",
                     # Add a second axis and specify its features
                     sec.axis = sec_axis(trans=~./10,  
                                         name="Fully vaccinated per hundred (dashed)")) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(NULL,
               breaks = breaks_width("6 months"), 
               labels = label_date_short()) + 
  labs(color="Country") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank()) + 
  facet_wrap(~location)
p4

Save the plot to a file.

# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/hospit_vacc_rates_facet_country.pdf", width=14, height = 8)

Questions

  1. What other variables we could analyse?
  2. Any correlated variables?
  3. What we should take into account that might bias the results or the true status of the pandemic?
1. Mortalities (Case Fatality Rate)?  
2. Suggestions? (cases per population density, vaccination rates by country income, deaths by number of beds per capita, etc.
3. Level of reporting in each country...  

Additional Resources

Now take the plunge and start practicing with your own data!!

tweet_embed("https://twitter.com/YannisMarkonis/status/1276211134186602499")

Contact

Please contact me at i.bar@griffith.edu.au for any questions or comments.

References

Mathieu E, Ritchie H, Ortiz-Ospina E, et al. (2021) A global database of COVID-19 vaccinations. Nat Hum Behav 5:947–953. doi: 10.1038/s41562-021-01122-8